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Abstract 

Purpose : Monte Carlo methods are considered the gold standard for dosimetric computations in 
radiotherapy. Their execution time is however still an obstacle to the routine use of Monte Carlo 
packages in a clinical setting. To address this problem, a completely new, and designed from the 
ground up for the GPU, Monte Carlo dose calculation package for voxelized geometries is proposed: 
GPUMCD. 

Method : GPUMCD implements a coupled photon-electron Monte Carlo simulation for energies 
in the range 0.01 MeV to 20 MeV. An analogue simulation of photon interactions is used and a 
Class II condensed history method has been implemented for the simulation of electrons. A new 
GPU random number generator, some divergence reduction methods as well as other optimization 
strategies are also described. GPUMCD was run on a NVIDIA GTX480 while single threaded 
implementations of EGSnrc and DPM were run on an Intel Core 17 860. 

Results : Dosimetric results obtained with GPUMCD were compared to EGSnrc. In all but one 
test case, 98% or more of all significant voxels passed a gamma criteria of 2%-2mm. In terms of 
execution speed and efficiency, GPUMCD is more than 900 times faster than EGSnrc and more 
than 200 times faster than DPM, a Monte Carlo package aiming fast executions. Absolute execution 
times of less than 0.3 s are found for the simulation of IM electrons and 4M photons in water for 
monoenergetic beams of 15 MeV, including CPU-CPU memory transfers. 

Conclusion : GPUMCD, a new GPU-oriented Monte Carlo dose calculation platform, has been 
compared to EGSnrc and DPM in terms of dosimetric results and execution speed. Its accuracy and 
speed make it an interesting solution for full Monte Carlo dose calculation in radiation oncology. 
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14 I. INTRODUCTION 



15 Monte Carlo packages such as EGS4 [T], EGSnrc |2l |3], EGS5 0] and PENELOPE ^ E] 

16 have been extensively compared to experimental data in a large array of conditions, and 

17 generally demonstrate excellent agreement with measurements. EGSnrc, for instance, was 

18 shown to "pass the Fano cavity test at the 0.1% level" Despite this accuracy, Monte 

19 Carlo platforms are mostly absent from routine dose calculation in radiation therapy due to 

20 the long computation time required to achieve sufficient statistical significance. 

21 Monte Carlo simulations are generally considered the gold standard for benchmarking 

22 analytical dose calculation approaches such as pencil-beam based computations {HHTO] and 

23 convolution-superposition techniques [TT| [T2] . Such techniques typically use precomputed 

24 Monte Carlo data to incorporate physics-rich elements in the dose calculation process. These 

25 semi-empirical methods were developed as a trade-off between accuracy and computation 

26 time and as such do not match the level of accuracy offered by Monte Carlo simulations, 

27 especially in complex heterogeneous geometries where differences of up to 10% were re- 

28 ported |T3l - [l5] . 

29 Fast Monte Carlo platforms have been developed for the specific purpose of radiotherapy 

30 dose calculations, namely VMC [T6HT8] and DPM [19]. These packages make some sacrifices 

31 to the generality and absolute accuracy of the simulation by, for example, restricting the 

32 energy range of incoming particles. This assumption allows simpler treatment of certain 

33 particle interactions which are less relevant within the limited energy range considered. 

34 Gains in efficiency of around 50 times can be achieved by these packages when compared to 

35 general-purpose solutions. 

36 This paper presents GPUMCD (GPU Monte Carlo Dose), a new code that follows the 

37 fast Monte Carlo approach. GPUMCD relies on a relatively new hardware platform for 

38 computations: the GPU (Graphics Processing Unit). The GPU is gaining momentum in 

39 medical physics pOVEH] as well as in other spheres [21H2H] where high-performance computing 

40 is required. A layer-oriented simulation based on the MCML Monte Carlo code for photon 

41 transport has already been ported to the GPU [29]. The photon transport algorithm from 

42 PENELOPE has also been ported to the GPU |30j for accelerations of up to 27x. DPM [19] 

43 has been ported to the GPU with excellent dosimetric results compared to the CPU version, 

44 as expected, but with relatively low (5-6. 6x) acceleration factors [31] . 
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45 A comparison of GPUMCD to the work by Jia et al. must be made cautiously. The work 

46 of Jia et al. is a port of an existing algorithm that is not, to the best of our knowledge, 

47 designed for the specific nature of the GPU. Notable differences are found between the two 

48 approaches regarding the treatment of secondary particles and the electron-photon coupling. 

49 These differences will be exposed in Sec. |II C 3 The code presented in this article is the first 

50 attempt of a complete rewrite of a coupled photon-electron Monte Carlo code specifically 

51 designed for the GPU. New techniques for the memory management of particles as well 

52 as efforts to reduce the inherent divergent nature of Monte Carlo algorithms are detailed. 

53 Divergence in GPU programming arises from conditional branching, which is not optimal 

54 in stream processing where predictable execution is expected for optimal performances. No 

55 new sampling methods used in Monte Carlo simulations were introduced in GPUMCD; all 

56 theoretical developments of cross section sampling methods have been adapted from their 

57 description in general-purpose Monte Carlo package manuals (and references therein) listed 

58 previously. However, their actual implementation is original, as it is specifically tailored 

59 for parallel execution on graphics hardware, and the code has not been taken from existing 

60 Monte Carlo implementations. 

61 The paper is structured as follows: Sec. [TT] introduces the physics principles as well as 



62 the hardware platform and implementation details. Sec. Ill presents the results in terms of 

63 dose and calculation efficiency of GPUMCD compared to the EGSnrc and DPM platforms. 



64 Finally, a discussion of the reported results is presented in Sec. IV 



65 II. MATERIAL AND METHODS 

66 A. Photons simulation 

67 Photon transport can be modeled with an analogue simulation, i.e. every interaction 

68 is modeled independently until the particle leaves the geometry or its energy falls below a 

69 certain energy level referenced as Pcut from now on. This allows for an easy implementation 

70 of the transport process. 

71 GPUMCD takes into account Compton scattering, the photoelectric effect and pair pro- 

72 duction. Rayleigh scattering can be safely ignored for the energy range considered (10 keV- 

73 20 MeV). Considering a homogeneous phantom, the distance to the next interaction, noted 
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74 s, can be sampled from the following expression: 

^ = -^ln(0. (1) 

75 where ( is a random variable uniformly distributed in (0,1] and n{E) is the value of the total 

76 attenuation coefficient, given by 



P V^compton ~\~ ^photo ~l~ ^pair) i (2) 

77 where the a's are total cross section values for the corresponding interactions, A^^ is Avo- 

78 gadro's number, A is the molecular weight and p is the density. 

79 To eliminate the need for a distinct geometry engine in heterogeneous situations, 

80 GPUMCD employs the Woodcock raytracing algorithm [32] in which the volume can be 

81 considered homogeneously attenuating by introducing a fictitious attenuation coefficient, 

82 corresponding to a fictitious interaction which leaves the direction and energy of the particle 

83 unchanged, in every voxel x: 

Kx) fict = fJ-max - (3) 

84 where fimax is the attenuation coefficient of the most attenuating voxel inside the volume. 

85 The distance to the next interaction is therefore always sampled using fimax- The sampling 

86 of the interaction type, once at the interaction position, is then performed taking into 

87 consideration this fictitious interaction type. This method is not an approximation, it leads 

88 to the correct result even in arbitrarily heterogeneous geometries. 

89 Compton scattering is modeled with a free electron approximation using the Klein-Nishina 

90 cross section [33j . The energy and direction of the scattered photon and secondary electron 

91 are sampled according to the direct method derived by Everett et al. |3lj. Binding effects and 

92 atomic relaxation are not modeled. These approximations are reasonable when the particle 

93 energy is large compared to the electron binding energy which is the case for radiation 

94 therapy [3S]. 

95 The photoelectric effect is modeled by once again ignoring atomic relaxation. Addition- 

96 ally, shell sampling is also ignored and all electrons are assumed to be ejected from the 

97 K-shell. The sole product of the interaction is a new electron in motion with total energy 

98 Eeiec = Ephot- The angular sampling of the electron is selected according to the Sauter 

99 distribution [36] following the sampling formula presented in Sec. 2.2 of the PENELOPE 
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100 manual [5]. The simulation of the photoelectric effect will become less accurate when the 

101 energy of the photon is in the order of the characteristic X-ray radiation energy of the mate- 

102 rial since characteristic X-rays are not produced. Therefore, at the previously stated lower 

103 energy range threshold of 10 keV, photons are still tracked and simulated but the accu- 

104 racy of the simulation of the photoelectric effect will be dependent on the material where 

105 the simulation is taking place. The binding energies being low in organic compound, this 

106 approximation is considered acceptable. 

107 Relatively crude approximations are employed when simulating pair production events. 

108 First and foremost, no positrons are simulated in this package. The pair production event 

109 instead generates two secondary electrons. The relatively low probability of pair production 
no events at the energies considered as well as the additional low probability of in-fiight an- 

111 nihilation of the positron make this approximation suitable jT9|. Triplet production is also 

112 not modeled. The energy of the incoming photon is trivially split between the two electrons 

113 using a uniformly distributed random number. The angle of both electrons is sampled using 

114 the algorithm presented in Eq. 2.1.18 of the EGSnrc manual |37] . 

115 B. Electron simulation 

116 Because of the much larger number of interactions experienced by a charged particle be- 

117 fore it has deposited all of its energy, analogue simulations cannot be used practically. A 

118 class II condensed history method is used here, as defined by Berger [38], in which inter- 

119 actions are simulated explicitly only if they cause a change in orientation greater than 9c 

120 or a change in energy larger than E^. Below these thresholds, the interactions are modeled 

121 as taking part of a larger condensed interaction, the result of which is a major change in 

122 direction and energy. Above the thresholds, interactions, usually called hard or catastrophic 

123 interactions, are modeled in an analogue manner similar to the simulation of photons. 

124 1. Hard interactions 

125 GPUMCD simulates inelastic collisions as well as bremsstrahlung in an analogue manner, 

126 as long as the changes to the state of the particle are greater than the previously discussed 

127 threshold. 
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128 For inelastic collisions, a free electron approximation is used and no electron impact 

129 ionization nor spin effects are modeled. The M0ller cross section is used to describe the 

130 change in energy and direction of the scattered and knock-on electrons. The sampling 

131 routine presented in Sec. 2.4.3.i of the EGSnrc manual [37j is used. 

132 During a bremsstrahlung event, the energy of the created photon is sampled as presented 

133 in Sec. 2.4.2.ii of the EGSnrc manual. Approximations to the angular events are employed : 

134 the angle of the electron is unchanged and the angle of the photon is selected as ^ = rrioC^/E 

135 where E is the energy of the incoming electron. 

136 2. Multiple scattering 

137 The class II condensed history method requires the selection of an angle for the multiply- 

138 scattered electron after a step, as discussed in the next subsection. To this end, the formu- 

139 lation presented by Kawrakow and Bielajew j39] and Kawrakow [2] is used. The required 

140 data are imported from the msnew . dat file packaged with EGSnrc. 

141 3. Electrons transport 

142 The transport of electrons is less straightforward than photon transport because of the 

143 high probability of soft collisions, which are regrouped in a single step between two hard inter- 

144 actions (collision or bremsstrahlung events). This condensed history method introduces an 

145 unphysical (but mathematically converging) factor for the length of the multiple-scattering 

146 step. Between consecutive hard collisions, which are separated by a distance sampled with an 

147 electron version of Eq. ([T]), a number of multiple-scattering events will occur. In GPUMCD, 

148 a simple step-length selection (henceforth e^) is used and can be summarized as 

149 where d^ox is the distance to the next voxel boundary, Cg-max is a user-supplied upper bound 

150 of step length and Chard is the distance to the next hard interaction. 

151 The distance to the next hard interaction can once again be sampled with the help of 

152 the Woodcock raytracing algorithm. For electron transport, however, the total attenuation 

153 coefficient, IJ,{E), is in fact fJ,{E, s) where the dependence on distance has been made explicit. 
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154 The dependence on distance is due to the fact that electrons will lose energy due to sub- 
165 threshold interactions between two successive hard interactions. In this application, the 

156 approximation is made that the attenuation coefficient is decreasing with respect to the 

157 particle energy. It is not rigorously true, as shown in [2j, but allows for an easier treatment 

158 since ^max{E) can be selected with E being the energy at the beginning of the electron step. 

159 Energy losses are experienced by the electron during the soft scattering events that are 

160 modeled by the multiple-scattering step. This energy loss is accounted for with the con- 

161 tinuously slowing down approximation (CSDA). This approximation states that charged 

162 particles are continuously being slowed down due to the interactions they go through and 

163 that the rate of energy loss is equal to the stopping power 5, given by: 

sm ^ -f . (5) 

164 The implementation of a class II condensed history method only accounts for sub-threshold 

165 interactions in the CSDA. The use of restricted stopping powers, L, is therefore required: 

L{E,K,^t)= / J:{E,E')dE' (6) 
Jo 

166 where E') is the total macroscopic cross section and Kcut is the catastrophic interaction 

167 kinetic energy threshold. Since this release of GPUMCD is e^-centric, the energy loss of a 

168 given step is computed with 

AE= r L{s)ds. (7) 
Jo 

169 The evaluation of the integral can be reduced to 

AE = L {Eo - L{Eo)s/2) (8) 

170 which is EGSnrc's Eq. 4.11.3. 

171 Between two consecutive hard collisions, the electron does not follow a straight line. 

172 Bielajew's alternate random hinge method [5] builds on PENELOPE's random hinge method 

173 to handle the angular deflection and lateral displacement experienced by an electron during 

174 an electron step. The random hinge method can be described as follows: the electron is 

175 ffist transported by (where C, is again a uniformly distributed random number), at which 



176 point the electron is rotated by the sampled multiple- scattering angle described in Sec. |II B 2 

177 and the energy loss is deposited. The electron is then transported for the remaining distance 

178 equal to (C — 1)6^. Bielajew's refinement of the algorithm involves randomly sampling the 
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179 angular deflection either before or after the electron has deposited the energy associated 

180 with the step. 



181 C. GPU implementation 

182 In this section, different aspects of the GPU implementation are detailed. GPUMCD is 

183 built with the CUDA framework from NVIDIA. A general description of the GPU building 



184 blocks and memory levels can be found in |10]. Sec. II CI presents the random number 



185 generator used in this work while Sec. II C 2 describes the memory management. Finally 



186 Sec. II C 3 addresses the SIMD divergence issue and presents solutions as well as other 



187 optimizations used in GPUMCD. 

188 1. Random number generator 

189 A pseudo random number generator (PRNG) is already available in the NVIDIA SDK 

190 (Software Development Kit) for GPU computing. However, it uses a lot of resources which 

191 would then be unavailable for the rest of the Monte Carlo simulation. For this reason, a 

192 new lightweight PRNG based on the work of Marsaglia [H] was implemented. We use a 

193 combined multiply-with-carry (MWC) generator, with recurrence of the form 

Xn+i = {a*Xn + c„) mod (6) (9) 

194 where a is the multiplier and b the base. The carry, c, is deflned by: 

Cn+l = [ ^ J. (10) 

195 The carry c is naturally computed with integer arithmetic and bases b of 2^^ or 2^^. The 

196 choice of the multiplier a is not arbitrary: the multiplier is chosen so that a6 — 1 is safe 

197 prime. The period of the PRNG with such a multiplier is on the order of {ab — l)/2. This 

198 PRNG has the advantage of using a small amount of registers. By combining two of these 

199 generators for each thread, the PRNG uses two 32 bits values for the multipliers and two 

200 32 bits values for the current state. Ten integer operations (3 shifts, 2 ands, 2 mults and 

201 3 adds) are required to generate one new number. It has been shown to pass all tests but 

202 one of the TestUOl suite [42j, a program designed to test the quality of pseudo random 

203 numbers, and to achieve 96.7% of the peak integer bandwidth. 



204 2. Memory management 



205 GPUMCD is composed of four main arrays: a list of electrons and a list of photons, 

206 an array to store composition identifiers and another for density values corresponding to a 

207 numerical phantom. During the initialization phase, a call to the initParticle function 

208 is made which fills the correct particle array (depending on the source particle type) and 

209 according to the source type {e.g. with a parallel beam source all particles have a direction 

210 vector of (0,-1,0)). The particle arrays are allocated only once, with size MAXSIZE. The 

211 choice of MAXSIZE is graphics card dependent as the particle arrays occupy the better part 

212 of the global memory, the largest pool of memory on the graphics card. For instance, on 

213 a 1 Gb card, MAXSIZE could be set to 2^°. Only a fraction of the particle array will be 

214 filled with source particles, to leave room for secondary particle creation. If for instance 

215 MAXSIZE/d particles are generated, then each primary particle can generate an average of d 

216 secondary particle without running out of memory. The choice of d is therefore important; 

217 it is energy dependent and to a lesser extent geometry dependent. For example, with a 

218 monoenergetic 15 MeV beam of photons on a 32 cm deep geometry, a value of ci = 8 was 

219 found to be sufficient to accommodate all secondary particles. A global counter of active 

220 particles is kept for both electrons and photons and every time a new particle is added 

221 the counter is atomically incremented, until it is equal to MAXSIZE after which secondary 

222 particles are discarded. If the choice of d and MAXSIZE are such that the number of particles 

223 generated is lower than MAXSIZE, then no particles are discarded. Atomic operations are a 

224 mechanism that ensure that two parallel threads cannot write the same memory location at 

225 the same time, therefore discarding one thread's modification. Atomic operations are not 

226 ideal on parallel hardware but these two integer values, the current number of particles of 

227 both types, should be efficiently cached on GFIOO class hardware through the use of the 

228 GPU-wide 768 Kb of L2 cache. After the initialization phase is completed, the simulation 

229 starts by calling the function to simulate the initial particle type. Subsequently, secondary 

230 particles of the other type are generated and put inside the particle array for that type. 

231 The other type of particle will then be simulated, and so on until a relatively low number of 

232 particle is left in the stack. This number is user-selectable and could be set to to simulate 

233 every particle. Since a simulation pass always comes with some overhead, it could also be 

234 set to a non-zero value to avoid the launching of a simulation pass with few particles to 
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235 simulate which could have a negligible impact on the final distribution. If this impact is in 

236 fact negligible of course depends on the threshold selected. Photons and electrons are never 

237 simulated in parallel but instead the particles of the other type are placed in their respective 

238 array and wait there until the array with the current type of particle is exhausted. This, in 

239 turn, eliminates the divergence due to the photon-electron coupling. Every time a call to a 

240 simulation function for electrons or photons is issued, one can consider that every particle of 



241 that type is simulated and that the counter for that type of particle is reset to 0. Sec. |II C 3 

242 will show some refinements related to the management of particles. 



243 The other two main arrays, one of composition identifier and one of density values, are 

244 stored as 3D textures since they are used to represent the volume of voxels. 3D textures 

245 reside in memory but they can be efficiently cached for 3D data locality. Every time a 

246 particle is moved, the composition and density of the current particle voxel are fetched from 

247 the 3D textures. 



248 Cross section and stopping power data are stored in global memory in the form of a ID 

249 texture. The cross section and stopping power data are preinterpolated on a regular grid 

250 with a value at every 1 keV. 

251 The shared memory usage is relatively limited in this application. For the electrons 

252 simulation, some specific composition attributes are required {e.g. Zy, Zq, etc., in Tab. 

253 2.1 of EGS5) and are stored in shared memory. These data values have not been placed in 

254 constant memory since that type of memory is best used when all threads of a warp access 

255 the same address, which cannot be guaranteed here since these composition attributes are 

256 material dependent. Two particles in different materials would therefore request the constant 

257 memory at two different addresses, resulting in serialization. No other use for shared memory 

258 has been found since the application is by nature stochastic. For instance, we cannot store 

259 a portion of the volume since there is no way to know where all the particles of one thread 

260 block will be. Similarly, we cannot store a portion of the cross-section data since there is no 

261 way to know in which material and with which energy all the particles of one thread block 

262 will be. 
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263 3. Divergence reduction and other optimization strategies 

264 Every multiprocessor (MP) inside the GPU is in fact a SIMD processor and the SIMD 

265 coherency has to be kept at the warp level which is the smallest unit of parallelism and 

266 is composed of 32 threads. The Monte Carlo simulation being inherently stochastic, it is 

267 impossible to predict which path a given particle will take and it is therefore impossible to 

268 regroup particle with the same fate into the same warps. When a warp is divergent, it is 

269 split into as many subwarps as there are execution paths, leading to a performance penalty. 

270 Divergence can be seen when, e.g. two particles of the same warp do not require the same 

271 number of interactions before exhaustion or do not interact in the same way. Some software 

272 mechanisms can be employed to reduce the impact of divergence. 

273 The first mechanism consists in performing a stream compaction after N simulation 

274 steps, where is user-defined and corresponds to a number of interactions for photons and 

275 to a number of catastrophic events for electrons. For example, if one particle requires 20 

276 interactions to complete and the others in the warp require less than 5, then some scalar 

277 processors (SP) in the MP will be idle while they wait for the slow particle to finish. This 

278 first mechanism then artificially limits the number of simulation steps a particle can undergo 

279 during one pass of the algorithm. After this number of steps, every particle that has been 

280 completely simulated is removed from the list of particles and the simulation is restarted 

281 for another steps, until every particle has been completely simulated. The removal of 

282 particles is not free and therefore it is not clear if this technique will have a positive effect 

283 on execution time. The stream compaction is accomplished with the CHAG library [43] . 

284 A second mechanism, named persistent thread by its creators and pool in GPUMCD, is 

285 taken from the world of graphics computing [45] . In this approach, the minimum number of 

286 threads to saturate the GPU is launched. These threads then select their workload from a 

287 global queue of particles to be simulated. Once a thread is done with one particle, it selects 

288 the next particle, until all have been simulated. This is once again to reduce the impact of 

289 the imbalance in the number of interactions per particle. 

290 All the GPU code uses single precision floating point numbers as they offer a significant 

291 speed improvement over double precision floating point numbers on graphics hardware. The 

292 work of Jia et al. |31) showed that no floating point arithmetic artifacts were introduced for 

293 this type of application. 
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294 A notable difference between tfiis work and tfie work by Jia et al. is tfie way tfie secondary 

295 particles are treated. In the work by Jia, a thread is responsible for its primary particle as 

296 well as every secondary particles it creates. This can be a major source of divergence because 

297 of the varying number of secondary particles created. On the other hand, GPUMCD does 

298 not immediately simulate secondary particles but instead places them in their respective 

299 particle arrays. After a given pass of the simulation is over, the arrays are checked for newly 

300 created secondary particles and if secondary particles are found, they are simulated. This 

301 eliminates the divergence due to the different number of secondary particles per primary 

302 particles. Additionally, since Jia et al. use a single thread per primary particles, a thread 

303 may be responsible for the simulation of both electrons and photons which can in turn be a 

304 major source of divergence. For example, at time t, thread A may be simulating a secondary 

305 electron while thread B a secondary photon. In other words, they simulate both photons 

306 and electrons at the same time. Since both of these particles most likely have completely 

307 different code path, heavy seriahzation will occur. On the other hand, since GPUMCD 

308 stores secondary particles in arrays to be simulated later, no such divergence due to the 

309 electron-photon coupling occurs. 

310 Finally, GPUMCD can be configured to take advantage of a multi-GPU system. The 

311 multi-GPU approach is trivial for Monte Carlo simulations: both CPUs execute the 

312 initParticle function and simulate their own set of particles. After the simulations, 

313 the two resulting dose arrays are summed. Linear performance gains are expected for 

314 simulations when there are enough particles to simulate to overcome the overhead penalty 

315 resulting from copying input data to two graphics card instead of one. 

316 These optimization mechanisms can be turned on or off at the source code level in 

317 GPUMCD. 

318 D. Performance evaluation 

319 The performance evaluation of GPUMCD is twofold: a dosimetric evaluation against 

320 EGSnrc/DOSXYZnrc and an efficiency evaluation against EGSnrc and DPM, the later being 

321 a fairer comparison since DPM was designed for the same reason GPUMCD was developed, 

322 i.e. fast MC dose calculations. 

323 All settings for DPM were set to default, notably Kcut = 200 keV and Pcut = 50 keV. 
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324 Most parameters for DOSXYZnrz (unless otherwise noted) were also set to default, notably 

325 ESTEPE=0.25, C,max=^-5- In EGSnrc/DOSXYZnrc, the boundary crossing algorithm used 

326 was PRESTA-I and the electron step algorithm was PRESTA-II, atomic relaxation was turned 

327 off as well as bound Compton scattering. Values of Pcut and Ecut were set to 0.01 MeV and 

328 . 7 MeV respectively. The same 64^ grid with 0.5 cm^ voxels is used for all simulations on 

329 all platforms. These values of spatial resolution and number of voxels are insufficient for a 

330 clinical calculation with patient specific data; however, they were judged adequate for the 

331 benchmarking study conducted here. Preliminary results suggest that no loss of efficiency 

332 occurs between GPUMCD and other platforms as the number of voxels is increased. 

333 The uncertainty of Monte Carlo simulation results varies as 1/ where is the number 

334 of histories. All simulations ran for visualization purposes on GPUMCD were conducted with 

335 enough primary particles to achieve a statistical uncertainty of 1% or less. The same number 

336 of particles were then generated in DOSXYZnrc. The graphs are shown without error bars 

337 since they would simply add clutter to the results. For execution time comparisons, 4 million 

338 particles were generated and simulated for photon beams and 1 million particles for electron 

339 beams. All sources were modeled as a monoenergetic parallel beam. All dose distributions 

340 were normalized with respect to D^ax- The efficiency measure, e, was used to evaluate the 

341 performance of the different implementations: 

(11) 



342 where s is the statistical uncertainty and T the computation time. Only voxels with a dose 

343 higher than 0.2 ■ Dmax were considered for the statistical uncertainty, yielding the following 

344 expression for the uncertainty ^5] : 

1 J2 (12) 



345 where 



ACS. = ^^•'i;^'^"''' • (13) 



346 for A^ the number of histories simulated, (D) the mean of the random variable D and A''o.2 

347 the number of voxel with dose higher than 0.2Dmax- 
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348 For the dosimetric evaluation, a gamma index was used to compare the two dose distri- 

349 butions [10]. The gamma index for one voxel x is defined as 

7(x) = mm{r(x, x')}V{x'}, (14) 

350 and 

,,^^,,^^w^^;m^, (15) 

351 where — is the distance between voxels x and x', D{x) is the dose value of voxel 

352 s, A(i is the distance tolerance value and AD is the dose tolerance value. The criteria for 

353 acceptable calculation is defined for a gamma index value below or equal to 1.0. 

354 All simulations were run on the same PC comprising an Intel Core 17 860 and a NVIDIA 

355 GeForce GTX480 graphics card. DPM and DOSXYZnrc do not natively support multi- 

356 core architectures and have not been modified. GPUMCD, unless running in multi-GPU 

357 configuration, uses only one processor core. 



358 III. RESULTS 



359 Several slab-geometry phantoms are described in Sec. Ill A . In Sec. IIIB , the execution 

360 times and gains in efficiency are reported. A multi-GPU implementation is also tested and 

361 corresponding acceleration factors are presented. 



362 A. Dosimetric results 

363 For slab geometries, central axis percent depth dose (PDD) curves as well as dose profiles 

364 at different depths or isodose curves are presented. Gamma indices are calculated in the 

365 entire volume of voxels and the maximum and average values are reported. The number of 

366 voxels with a gamma value higher than 1.0 and 1.2 are also detailed in order to evaluate 

367 discrepancies in dose calculation. The gamma criteria used is set to 2% and 2 mm which is a 

368 generally accepted criteria for clinical dose calculation [47] . In every gamma index summary 

369 table: 1) 7™"^ is the maximum gamma value for the entire volume of voxel; 2) 7™^ is the 

370 average gamma value for voxels with D > cDmax', 3) is the number of voxels with 

371 D > cDmax ; 4) > g is the number (proportion) of voxels with D > cD^ax and 7 > 
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372 The first slab geometry is a simple water phantom. GPUMCD treats homogeneous 

373 phantoms as heterogeneous phantoms; the particles are transported as if evolving inside 

374 a heterogeneous environment and therefore there is no gain in efficiency resulting from 

375 homogeneous media. For this geometry, the results for an electron beam and a photon beam 

376 are presented in Fig. [T]and gamma results in Tab. [1} 





FIG. 1. FDD (top) and profile (bottom) of 15 MeV electron (left) and photon (right) beams on 
water. 



Farticles 


^max 


To. 2 


SL'o.2 


S70.2 > 1.0 S70.2 > 1.2 


Electrons 


1.24 


0.29 


1750 


37 (2%) 2 (-0%) 


Fhotons 


1.27 


0.18 


7500 


143 (2%) 19 (-0%) 



TABLE I. Gamma criteria summary for a 15 MeV beam on water. 

377 The second geometry is composed of a lung box inside a volume of water. It is designed 

378 to demonstrate the performance of GPUMCD with lateral and longitudinal disequilibrium 
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379 conditions. For electrons, the lung box is 4 cm long, 4.5 cm wide and starts at a depth of 

380 3 cm; for photons the lung box is 6.5 cm long, 4.5 cm wide and starts at a depth of 5.5 cm. 

381 In both cases, the box is centered on the central axis. For this geometry, FDD and isodose 

382 plots are presented in Fig. [2] for the electron and photon beams while the gamma results are 

383 presented in Tab. |TT} 




z (cm) z (cm) 

FIG. 2. FDD (top) and isodose (bottom) of a (left) 15 MeV electron beam on water with a 4 cm 
long, 4.5 cm wide box of lung at a depth of 3 cm and (right) 15 MeV photon beam on water with 
a 6.5 cm long, 4.5 cm wide box of lung at a depth of 5.5 cm. 



Particles 


^max 


a.vg 

To. 2 


SL>0.2 


S7o,2 > 1.0 S70.2 > 1.2 


Electrons 


1.40 


0.32 


2366 


46 (2%) 2 (0%) 


Photons 


1.30 


0.19 


7522 


122 (2%) 19 (-0%) 



TABLE II. Gamma criteria summary for a 15 MeV beam on water with a lung box. 

384 The third geometry is composed of soft tissue, bone and lung. The slabs, for electrons, 

385 are arranged as such: 2 cm of soft tissue, 2 cm of lung, 2 cm of bone followed by soft tissue. 
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386 For photons the geometry is: 3 cm of soft tissue, 2 cm of lung, 7 cm of bone followed by soft 

387 tissue. The results for the electron and photon beams are presented in Fig. |3] and gamma 

388 results in Tab. IIIII 





FIG. 3. FDD (top) and profile (bottom) of a (left) 15 MeV electron beam on a phantom composed 
of 2 cm of soft tissue, 2 cm of bone, 2 cm of lung followed by soft tissue and (right) 15 MeV photon 
beam on a phantom composed of 3 cm of soft tissue, 2 cm of lung, 7 cm of bone followed by soft 
tissue. 



Farticles 


^max 


To. 2 




S7o,2 > 1.0 S70.2 > 1.2 


Electrons 


1.84 


0.34 


1700 


138 (8%) 54 (3%) 


Fhotons 


1.31 


0.22 


7220 


12 (-0%) (0%) 



TABLE III. Gamma criteria summary for a 15 MeV beam on a phantom with layers of soft tissue, 
bone and lung. 
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389 B. Execution time and efficiency gain 



In this section, the absolute execution times as well as the overall speed and efficiency of 
GPUMCD is evaluated and compared to EGSnrc and DPM. In the following tables, Georrii 
is the water phantom, Geom2 is the water-lung phantom and Geom^ is the tissue-lung-bone 
phantom. Tegsutc and Topm are respectively the EGSnrc and DPM execution times. For 
the efficiency measurement, e, the fastest execution time of GPUMCD (with or without the 



395 optimization presented in Sec. II C 3) has been used. The acceleration factor and efficiency 

396 improvement, namely A and A^, also use the fastest time for GPUMCD. The acceleration 

397 factor is defined as 

Tref 



A 



T, 



398 and the efficiency improvement as 



A. 



GPUMCD 



^GPUMCD 



-ref 



(16) 



(17) 



399 Tab. |IV| presents the absolute execution times for all simulations with both electron and 

400 photon beams. 



Particles 


Geomi Geom2 Geom^ 




(seconds) 


Electrons 


0.118 0.147 0.162 


Photons 


0.269 0.275 0.366 



TABLE IV. Absolute execution times, in seconds, for GPUMCD in the three geometries for IM 
electrons and 4M photons using 15 MeV monoenergetic beams. 



401 Tab. |V]presents the acceleration results for photon and electron beams where 4M photons 

402 and IM electrons are simulated. 



403 Tab. VI presents the results of the divergence reduction methods and acceleration strate- 



404 gies presented in Sec. II C 3 for electron and photon beams. In the following table A. 



compi 



405 Apool represent the acceleration factors with respect to the base configuration execution 

406 time (Tbase), with the stream compaction or pool divergence reduction methods enabled, 

407 respectively. 
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Photons 


Electrons 




DPM EGSnrc 


DPM EGSnrc 


A 

Geomi 

A, 


453 1161 
515 1174 


246 1510 
291 1740 


A 

Geom2 

A, 


433 1058 
474 1305 


210 1264 
237 1403 


A 

Geom^ 

A, 


203 947 
220 1050 


225 1231 
242 1359 



TABLE V. Acceleration factors for IM 15 MeV electrons and 4M 15 MeV photons on the three 
geometries presented. Absolute execution times of 0.269 s, 0.275 s and 0.366 s were found with 
GPUMCD for Geonii, Geom2 and Geom^ respectively for a photon beam and 0.118 s, 0.147 s and 
0.162 s for an electron beam. 



Particles T^ase ^comp ApQQi 

Electrons 0.185 1.06 1.26 
Photons 0.275 0.85 0.98 

TABLE VL Acceleration results of the different strategies presented in Sec. II G 3 for the second 
geometry {geom2)- 



408 As detailed in Sec. IIC3 GPUMCD can take advantage of a system with multiple GPUs. 

409 The impact of parallelizing GPUMCD on two GPUs is presented in Tab. |VII where TPH 

410 (time per history) is the execution time normalized to one complete history including a 

411 primary particle and all its secondary particles, TPHg and TPHd are the execution time in 

412 single and dual GPU mode, respectively. The graphics cards used for this test are a pair of 

413 Tesla C1060. The reported execution times include every memory transfer to and from the 

414 graphics cards. 
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TPHs 


TPHd Ad/s 






ins) 


Electrons 


5.43 


2.82 1.92 


Photons 


5.20 


2.74 1.90 



TABLE VII. Execution times and acceleration factors achieved with a multi-GPU configuration. 



415 IV. DISCUSSION 



416 A. Dosimetric evaluation 



417 Figures [T] - [3] and Tables |T] - |III| present the dosimetric evaluation performed between 

418 GPUMCD and EGSnrc. An excellent agreement is observed for homogeneous phantoms 

419 with both photon and electron beams, except for the end of the build-up region where small 

420 differences of less than 2% can be observed. In both cases, the rest of the curve as well as 

421 the overall range of the particles are not affected. 

422 In heterogeneous simulations with low density materials, the agreement is also excellent. 

423 Differences of up to 2% can be seen in the build-up region for an incoming electron beam. 

424 These differences again do not affect the range of the particles. For a photon beam imping- 

425 ing on the water-lung phantom, the dose is in good agreement within as well as near the 

426 heterogeneity. 

427 In heterogeneous simulations with a high-Z material, differences of up to 2% are found 

428 with an impinging electron beam and no differences above 1% are found for the photon 

429 beam in the presented FDD while differences of up to 1.5% are found in the profiles. 

430 The overall quality of the dose comparison can be evaluated with the gamma value results 

431 presented in Tab. |T]to III Results suggest that the code is suitable for clinical applications 



432 as the dose accuracy is within the 2%/2 mm criteria, for all cases but one, which is below 

433 the recommended calculation accuracy proposed by Van Dyk [47]. For electron beams, the 

434 gamma criteria has been well respected in the first two geometries where at most 2% of the 

435 significant voxels failed. More pronounced differences have been found in the last geometry 

436 including a high-Z material where 8% of the significant voxels fail the gamma test. However, 

437 only 3% of all significant voxels fail that test with a value higher than 1.2, indicating that 
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438 the critera are slightly too strict. Indeed, with a 2.5%-2.5mm, 3% of all significant voxels 

439 fail the test. For photon beams, all geometries meet the gamma criteria for 98 % or more 

440 of all significant voxels. 



441 B. Execution times 

442 The differences in calculation efficiency between GPUMCD and the other two codes 

443 are due to three distinctive reasons: 1) differences in hardware computational power, 2) 

444 differences in the programming model and its adaptation to hardware and 3) differences 

445 in the physics simulated and approximations made. As an operation-wise equivalent CPU 

446 implementation of GPUMCD does not exist, these factors cannot be evaluated individually. 

447 Also, the simulation setup for the test cases uses monoenergetic beams, which can favor 

448 GPU implementations. A naive GPU implementation of polyenergetic beams would lead 

449 to additional stream divergence. However, simple measures such as grouping particles with 

450 similar energy values would reduce this divergence. Future work will explore this issue in 

451 the context of clinical calculations done with polyenergetic beams. 



452 Tab. |IV| shows the absolute execution times for electron and photon beams with GPUMCD. 

453 Execution times of less than 0.2 s are found for all electron cases. The execution time is 

454 higher in the simulation with lung when compared to the simulation in water likely due to 

455 the fact that more interactions are necessary. The execution time is higher in the simulation 

456 with bone because a larger number of fictitious interactions will be encountered. Similar 

457 conclusions can be taken in the cases with photon beams in which the first two simulations 

458 require less than 0.3 s and the simulation with a bone slab required more than 0.36 s. 

459 Tab. |V] shows the acceleration results comparing GPUMCD to DPM and EGSnrc. It 

460 can be seen that GPUMCD is consistently faster than DPM, by a factor of at least 203x 

461 for photon beams and 210x for electron beams. The disadvantage of using the Woodcock 

462 raytracing algorithm is apparent in Tab |V] where the geometry featuring a heavier than 

463 water slab has the lowest acceleration factor. An acceleration of more 1200x is observed 

464 when comparing to EGSnrc for both electron beans and more than 940x for photon beams. 

465 DPM and EGSnrc both employ much more sophisticated electron step algorithms com- 

466 pared to GPUMCD. Improvements to the electron-step algorithm in GPUMCD will most 

467 likely yield greater accelerations, as photon transport was shown to be much faster than the 
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468 reported coupled transport results [48], as well as improved dosimetric results. 

469 Tab. |VI| presents the execution times and acceleration factors for the different divergence 



470 reduction methods and optimizations presented in Sec. II C 3 Gains of up to 26% for 

471 electron beams are observed. However, acceleration factors below 1 are found for photon 

472 beams, showing that electron beams are more divergent. These results show that the GPU 

473 architecture is not ideal for Monte Carlo simulations but that the hardware is able to cope 

474 well with this divergence. The mechanisms employed in GPUMCD to reduce this penalty 

475 have either had a negative impact on the execution time or a small positive impact that is 

476 perhaps not worth the loss in code readability and maintainability. 

477 Tab. I VI I details the gain that can be obtained by executing GPUMCD on multiple 

478 graphics card. GPUMCD presents a linear growth with respect to the number of CPUs. 

479 This is expected as there is a relatively small amount of data transferred to the CPUs 

480 compared to the amount of computations required in a typical simulation. As the number 

481 of histories is reduced, so is the acceleration factor. 

482 Efficiency improvements can be achieved with variance reduction techniques (VRT). Sev- 

483 eral methods have been published to improve the calculation efficiency of EGSnrc for in- 

484 stance [2, 49j. A fair comparison of GPUMCD with other Monte Carlo packages would 

485 ideally involve an optimal usage of the codes. In this paper, no VRT was used for the 

486 comparisons. It is not clear yet how VRT could potentially be implemented in GPUMCD. 

487 In the eventuality that VRT are equally applicable to each code architecture, the efficiency 

488 improvements reported in this paper would remain roughly the same. Otherwise, GPUMCD 

489 might suffer from a performance penalty compared to VRT-enabled codes. 



490 V. CONCLUSION AND FUTURE WORK 

491 In this paper, a new fully coupled GPU-oriented Monte Carlo dose calculation platform 

492 was introduced. The accuracy of the code was evaluated with various geometries and com- 

493 pared to EGSnrc, a thoroughly validated Monte Carlo package. The overall speed of the 

494 platform was compared to DPM, an established fast Monte Carlo simulation package for 

495 dose calculations. For a 2%-2mm gamma criteria, the dose comparison is in agreement for 

496 98% or more of all significant voxels, except in one case: the tissue-bone-lung phantom with 

497 an electron beam where 92% of significant voxels pass the gamma criteria. These results 
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498 suggest that GPUMCD is suitable for clinical use. 

499 The execution speed achieved by GPUMCD, at least two orders of magnitude faster than 

500 DPM, let envision the use of accurate of Monte Carlo dose calculations for numerically 

501 intense applications such as IMRT or arc therapy optimizations. A 15 MeV electron beam 

502 dose calculation in water can be performed in less than 0.12 s for IM histories while a photon 

503 beam calculation takes less than 0.27 s for 4M histories. 

504 GPUMCD is currently under active development. Future work will include the abihty 

505 to work with phase space and spectrum files as well as the integration and validation of 

506 CT-based phantoms for dose calculation. Research towards variance reduction techniques 

507 that are compatible with the GPU architecture is also planned. 
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